When Cross-Validation Breaks Inferences

statistics

Author

Gabriel Lewis

Published

July 10, 2024

Often, we don’t know which model to use for inference, so we select the model that makes the most accurate predictions on held-out data¹ and then use that model to draw inferences from all of our data — e.g. compute confidence intervals, hypothesis tests. Perhaps surprisingly, this standard procedure can go badly awry. The very act of selecting our model for its predictive accuracy can formally invalidate inferences from that model — even if our model was correctly selected from a set of candidate models that all deliver valid inferences. If you find this claim bizarre, suspicious, or profoundly annoying, you’re not alone.

The problem of “post-selection inference,” sometimes called the “specification test paradox,” has bothered theorists for a long time.² In principle, it presents us with a serious dilemma: either a large fraction of real-world statistical inferences are formally invalidated by model selection, or the prevailing frequentist theory of statistical inference conflicts with sound scientific practices.

But most applied researchers I talk to about this problem are convinced that a) it doesn’t really happen in practice, or b) it might happen, but it doesn’t matter. So I’m doing some simulations to show that at least it does happen: in very simple and plausible scenarios, model selection can substantially invalidate frequentist inferences.

Here’s a general explanation of how this seeming paradox happens. A valid confidence interval is designed to capture the truth 95% of the time, unconditionally — that is, not conditioning on the result of any other procedure applied to the data. However, when we calculate confidence intervals after model selection, we are in fact calculating confidence intervals conditional on the model-selection procedure. This conditioning generally breaks the 95% confidence guarantee, for the basic reason that conditional probability distributions don’t have to resemble unconditional ones.

I suspect that this argument alone won’t convince most people that there’s a problem, because our intuition says that selecting the model that predicts best should produce better inferences. Certainly, standard practice in statistics, econometrics, and data science is to apply a battery of predictive tests, and to feel much better about the model that passes the tests. At worst, our intuition says, after model selection our 95% CI should end up covering the truth more than 95% of the time.

Well, in this case our intuitions and standard practices are demonstrably wrong.

Simulations

We’re going to simulate a very straightforward situation where the true model is

\(\begin{aligned} y = x+\epsilon\\ \epsilon \sim N(0,1) \end{aligned}\)

and we’re comparing two candidate models

\(y = \alpha +\beta x+\epsilon\)
\(y = \alpha +\beta x+\gamma x^2 + \epsilon\)

Notice that both candidate models encompass the true model, so both are in a sense correct, but Model 1 is simpler, so it is in a sense more correct.

We’ll use standard 10-fold cross-validation, using an R package because I’m lazy. Whichever model has the lowest MSE across the 10 folds will be our “selected model.”

This code isn’t hyper-optimized or elegant; but then again, nor am I.

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(estimatr)
library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

Code

library(modelr)

Code

set.seed(350857) #For reproducibility!
n_data <- 200 #Sample size in each dataset
n_sims <- 4000 #Number of simulations
true_beta <- 1 #True parameter being estimated

#Initialize data frame to hold monte carlo simulations
data_out <- tibble(
  model1_beta_lo = rep(NA, n_sims), 
  model1_beta_hi = rep(NA, n_sims), 
  model2_beta_lo = rep(NA, n_sims), 
  model2_beta_hi = rep(NA, n_sims), 
  chosen_beta_lo = rep(NA, n_sims), 
  chosen_beta_hi = rep(NA, n_sims), 
  relevance_var = rep(NA, n_sims),
  relevance_var2 = rep(NA, n_sims)         
                   )
#Run simulation as a for-loop
system.time(
  for(i in 1:n_sims){
  
    #Generate data
  data_i <- tibble(
  X = rnorm(n_data) + 5,
  Y = true_beta*X + rnorm(n_data)    #Outcome
  )
  
  #Split data into 10 folds (splits into "test" and "train") for cross-validation
  data_i_cv <- modelr::crossv_kfold(data_i, k = 10)
  
#Fit model 1 on each training fold
model1_cv <- map(data_i_cv$train, 
                 ~lm_robust(Y ~ X ,
                            se_type = "HC3",
                            data = .)
               )

#Calculate model 1 MSE on each test dataset
model1_mse <- map2_dbl(model1_cv, data_i_cv$test, mse) |> mean()

#Fit model 2 on each training fold
model2_cv <- map(data_i_cv$train, 
                 ~lm_robust(Y ~ X + I(X^2),
                            se_type = "HC3",
                            data = .)
                 )
#Calculate model 2 MSE on each test dataset
model2_mse <- map2_dbl(model2_cv, data_i_cv$test, mse) |> mean()

#Fit both models on full dataset
model1 <- lm_robust(Y ~ X ,
       se_type = "HC3",
       data = data_i)

 data_out[i,"model1_beta_lo"] <- model1$conf.low[2]
 data_out[i,"model1_beta_hi"] <- model1$conf.high[2]

model2 <- lm_robust(Y ~ X + I(X^2),
       se_type = "HC3",
       data = data_i)

 data_out[i,"model2_beta_lo"] <- model2$conf.low[2]
 data_out[i,"model2_beta_hi"] <- model2$conf.high[2]
 
 
 #Select model with lowest mean-squared error
  if(model1_mse < model2_mse){
 data_out[i,"chosen_beta_lo"] <- model1$conf.low[2]
 data_out[i,"chosen_beta_hi"] <- model1$conf.high[2]
  }else{
 data_out[i,"chosen_beta_lo"] <- model2$conf.low[2]
 data_out[i,"chosen_beta_hi"] <- model2$conf.high[2]
  }
 #Relevance variable
 data_out[i,"relevance_var"] <- 1*(model1_mse < model2_mse)
  
}
  )

   user  system elapsed 
100.528   0.943 101.512

Code

#Check whether CIs have covered true beta
data_out <- data_out %>% rowwise() %>% 
  mutate(
    covered_mod1 = between(true_beta, model1_beta_lo, model1_beta_hi),
    covered_mod2 = between(true_beta, model2_beta_lo, model2_beta_hi),
    covered_chosen = between(true_beta, chosen_beta_lo, chosen_beta_hi),
         )

#Table of outputs
tribble(
  ~Model, ~`Coverage Rate`,
  "Model 1", data_out$covered_mod1 |> mean() |> round(2),
  "Model 2", data_out$covered_mod2 |> mean() |> round(2),
  "Selected Model", data_out$covered_chosen |> mean() |> round(2),
) |> 
  kable()

Model	Coverage Rate
Model 1	0.95
Model 2	0.95
Selected Model	0.91

Above, we can verify that the coverage rate for each model’s 95% CI is indeed 95%, so both Model 1 and Model 2 produce valid inferences on their own; both are in a sense correct.

However, the models we selected through cross-validation have a coverage rate around 91%, meaning that they capture the true parameter substantially less often than they should. This is as though you were claiming you found a p-value of p=0.05, when really you found p=0.09; in academic circles, this could be considered a major error or fraud.

If this discrepancy still doesn’t impress you, then rest assured that with some extra effort and computation time I won’t expend right now, we could tweak the simulation to make this discrepancy worse — by increasing the number of candidate models, for example.

To understand why the selected model’s coverage rate is lower than it ought to be, let’s consider the coverage rate of each model, given that it is selected:

Code

tribble(
  ~Model, ~`Coverage Rate`,
  "Model 1 - when selected", data_out |> filter(relevance_var == 1) |> pull(covered_mod1) |> mean() |> round(2),
  "Model 2 - when selected", data_out |> filter(relevance_var == 0) |> pull(covered_mod2) |> mean() |> round(2),
) |> 
  kable()

Model	Coverage Rate
Model 1 - when selected	0.95
Model 2 - when selected	0.70

Model 1 has the correct coverage rate given that it is selected, but Model 2 has a very low coverage rate given that it is selected; so it’s the selected Model 2’s that reduce the average coverage rate of our selected models. This is roughly because Model 2 predicts better than Model 1 (and hence is selected) only in the weird datasets where chance has conspired to make \(x^2\) (which Model 2 includes but not Model 1) a spuriously good predictor, despite that \(x^2\) is not a relevant predictor under the true data-generating process.

Particulars aside, it’s worth emphasizing that this problem is in principle very general, not just restricted to cross-validation or the models given above. It also affects hypothesis tests. In theory, most of the robustness checks, specification tests, etc. that researchers use to guide model building and model selection invalidate our statistical inferences in the way demonstrated above.

A quick answer

If you have lots of independent data, there’s a straightforward way to address the specification test problem: randomly split your sample into two subsamples. Use the first subsample to select your model, and the second subsample to fit your selected model and draw inferences. If your two subsamples are truly independent, then by definition, conditioning on any function of your first subsample (e.g. a model selection test statistic) cannot affect the distribution of any function of your second subsample (e.g. your confidence interval). So the distribution of your 95% confidence interval, given that it came from model selection, is equal to its unconditional (“marginal”) distribution, in which it does theoretically have a 95% chance of capturing the truth: the 95% guarantee now holds after model selection.

But what fraction of our data should we allocate to each subsample? Well, we face a tradeoff between selecting a predictively accurate model (devoting more data to the first subsample), and drawing precise inferences from whichever model we select (devoting more data to the second subsample). There’s no straightforward answer to this question.

However we split the data, this procedure has the interesting feature that two different researchers who make different sample splits could in principle derive totally different inferences from each other, despite beginning with the same data.

Not so fast

Data is (or ought to be) used by more than one research team, and researchers respond to each other’s work — that’s the essence of scientific inquiry. So the data that one researcher holds out for model selection will often be used by another researcher for model inference; and the model that one researcher selects will affect the model that another researcher selects.

This means that in practice, there will often be a statistical dependency between model selection and model inference in any given field of research, even when each individual researcher is rigorously splitting his or her sample. At least in principle, this dependency creates the same problem that we have just documented above.

Conclusion

Is this a problem with prevailing scientific practice, or is it really a problem with frequentist inference? Or not a problem at all? Personally, I have trouble deciding. A theory of inference that conflicts too strongly with successful scientific practices casts itself in doubt. But in the social sciences where model selection seems to be the biggest problem, how successful are our scientific practices, really? Unlike in, say, engineering, when we are wrong, things don’t necessarily fall apart; biased estimates can be replicated by biased estimates, and economies can thrive despite clueless economists. Or so we must hope.

Footnotes

E.g. select the model that has the lowest mean-squared error in k-fold cross-validation.↩︎
For recent work and a review, see Kuchibhotla et al, 2020. Notice that their proposed solutions really don’t cover most of the things that people actually do to their data.↩︎